Regression analysis is a simple yet powerful method to find the relations within a dataset. In this post, we will look at the insurance charges data obtained from Kaggle (https://www.kaggle.com/mirichoi0218/insurance/home). This data set consists of 7 columns: age, sex, bmi, children, smoker, region and charges. We will get into more details about these variables later.
The key questions that we would be asking are:
We start with importing the required libraries:
library(magrittr)
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:magrittr':
##
## set_names
library(MASS)
library(car)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
library(broom)
## Warning: package 'broom' was built under R version 3.4.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
library(psych)
## Warning: package 'psych' was built under R version 3.4.4
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:car':
##
## logit
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
We import the data from the csv. We can see an overview of the data using summary() function.
insurance <- read.csv('insurance.csv')
summary(insurance)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
The key points that can be taken from the summary are:
Linear regression follows the formula :
y = beta+ .
The coefficients in this linear equation denote the magnitude of additive relation between the predictor and the response.
As such, the null hypothesis would be that there is no relation between any of the predictors and the response, which would be possible when all the coefficients for the predictors are 0. The alternate hypothesis would be that atleast one of the predictors has a relation with the outcome, that is the coefficient of one of the predictors is non-zero.
This hypothesis is tested by computing the F-statistic. in case of no relationship between the predictor and the response, F-statistic will be closer to 1. On the contrary, if the alternate hypothesis is true, the F-statistic will be greater than 1. The p-value of F-statistic can be calculated using the number of records (n) and the number of predictors, and can then be used to determined whether the null hypothesis can be rejected or not.
We will start with fitting a multiple linear regression model using all the predictors:
lm.fit <- lm(formula = charges~., data = insurance)
#charges~. is the formula being used for linear regression. Here '.' means all the predictors in the dataset.
summary(lm.fit)
##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
A high value of F-statistic, with a significant p-value(<2.2e-16), implies that the null hypothesis can be rejected. This means there is a potential relationship between the predictors and the outcome.
RSE (Residual Standard Error) is the estimate of standard deviation of irreducible error. I simpler words, it is the average difference between the actual outcome and the outcome from the fitted regression line. Hence, a large value of RSE means a high deviation from the true regression line. As such, RSE is useful in determining the lack of fit of the model to the data. RSE in our model is large (6062), indicating that the model doeswn’t fit the data well.
R-squared measures the proportion of variability in Y that can be explained by X, and is always between 0 and 1. A high value of R-squared (0.7494) shows that around 75% of variance of the data is being explained by the model.
If we look at the p-values of the estimated coefficients above, we see that not all the coefficients are statistically significant. This means that only a subset of the predictors are related to the outcome. The question is which one.
We can look at the individual p-values for selecting the variables. This may not be a problem when the number of predictors(7) is quite small compared to the number of observations(1338). This method won’t, however, work when number of predictors is greater than the number of observations. In such cases, we would have to use the feature/variable selection methods, like forward selection, backward selection, or mixed selection. Before jumping on to feature selection using any of these methods, let us try regression using the features with significant p-values.
lm.fit.sel <- lm(charges~age+bmi+children+smoker+region, data = insurance)
summary(lm.fit.sel)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## smokeryes 23836.30 411.86 57.875 < 2e-16 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
We will compare this to mixed variable selection, which is a combination of forward selection and backward selection.
step.lm.fit <- stepAIC(lm.fit, direction = "both", trace = FALSE)
summary(step.lm.fit)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## smokeryes 23836.30 411.86 57.875 < 2e-16 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
The model given by stepwise selection is same as the model we got by selecting predictors with significant p-values; so the simple method of selecting the coefficients on the basis of p-values works in this case.
We can see that there is a very slight improvement in R-squared value of the model(0.7494 -> 0.7496), with a very slight deterioration in RSE. (6062 -> 6060)
Some key inferences to be taken from the model are:
By applying linear regression, we are assuming that there is a linear relationship between the predictors and the outcome. If the underlying relationship is quite far from linear, then most of the inferences we have made so far are doubtful. This also means reduced accuracy of model.
The non-linearity of the model can be determined using residual plots. For multiple linear regression, we can plot the residuals versus fitted values. Presence of a pattern in the residual plots would imply a problem with the linear assumption of the model.
residualPlot(step.lm.fit, type = "pearson", id=TRUE)
The blue line is a smooth fit of quadratic regression of Residuals as response and the Fitted values as the regressor. Thecurve is quite close to a straight line, indicating that the underlying data approximately follows linearity. (That number 1301 and 578; we’ll get to that later)
We can further plot the residual plots of individual predictors and residuals to see if any of the predictors demonstrate non-linearity.
residualPlots(step.lm.fit)
## Test stat Pr(>|Test stat|)
## age 3.8860 0.000107 ***
## bmi -2.2167 0.026811 *
## children -0.9573 0.338601
## smoker
## region
## Tukey test 11.8537 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We don’t see any non-linearity with respect to individual predictors either.
par(mfrow=c(2,2))
plot(step.lm.fit)
An important assumption of linear regression model is that the consecutive error terms are uncorrelated. The standard errors of the estimated regression coefficients are calculated on this basis. Hence, if the consecutive error terms are correlated, the standard errors of the estimated regression coefficients may be much larger.
We can check the auto-correlation of residuals using the Durbin-Watson test. The null hypothesis is that the residuals have no auto-correlation. The alternate hypothesis is that the the residuals have a statistically significant correlation:
set.seed(1)
# Test for Autocorrelated Errors
durbinWatsonTest(step.lm.fit, max.lag = 5, reps=1000)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.04582739 2.088964 0.104
## 2 -0.03919652 2.075623 0.138
## 3 -0.02781912 2.052635 0.322
## 4 0.02815118 1.933913 0.268
## 5 -0.01764343 2.025369 0.582
## Alternative hypothesis: rho[lag] != 0
Here we are checking for auto-correlation of residuals for 5 different lags. The p-value for none of the lags is less than 0.05. Hence, we cannot reject the null hypothesis.
res <- step.lm.fit$residuals %>%
tidy
res$names <- as.numeric(res$names)
res%>%
ggplot +
geom_point(aes(x=names, y=x)) +
labs(x='index', y='residuals')
Constant variance of residuals is another assumption of a linear regression model. The error terms may, for instance, change with the value of the response variable. One of the methods of identifying non-constant variance of errors is presence of a funnel shape in the residual plot. A more concrete way is an extension of the Breusch-Pagan Test, available in R as ncvTest() in the cars package. It assumes a null hypothesis of constant variance against the alternate hypothesis that the error variance changes with the level of the response or with a linear combination of predictors.
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(step.lm.fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 235.2819 Df = 1 p = 4.20249e-53
#lmtest::bptest(step.lm.fit)
# plot studentized residuals vs. fitted values
spreadLevelPlot(step.lm.fit)
## Warning in spreadLevelPlot.lm(step.lm.fit):
## 20 negative fitted values removed
##
## Suggested power transformation: 0.2302333
A very low p-value(~4.2e-53) means the null hypothesis can be rejected. In other words, there is a high chance that errors have a non-constant variance. From the graph, we can also see how the spread of absolute studentized residuals is varying with increased value of fitted values. On of the methods to fix this problem is transformation of the outcome variable.
bc <- boxcox(step.lm.fit, plotit = F)
powTrans <- bc$x[which.max(bc$y)]
chargesBCTrans <- BoxCoxTrans(insurance$charges)
chargesBCTrans
## Box-Cox Transformation
##
## 1338 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1122 4740 9382 13270 16640 63770
##
## Largest/Smallest: 56.8
## Sample Skewness: 1.51
##
## Estimated Lambda: 0
## With fudge factor, Lambda = 0 will be used for transformations
insurance$chargesTrans <- predict(chargesBCTrans, insurance$charges)
head(insurance)
## age sex bmi children smoker region charges chargesTrans
## 1 19 female 27.900 0 yes southwest 16884.924 9.734176
## 2 18 male 33.770 1 no southeast 1725.552 7.453302
## 3 28 male 33.000 3 no southeast 4449.462 8.400538
## 4 33 male 22.705 0 no northwest 21984.471 9.998092
## 5 32 male 28.880 0 no northwest 3866.855 8.260197
## 6 31 female 25.740 0 no southeast 3756.622 8.231275
trans.lm.fit <- update(step.lm.fit, charges^0.41~.)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(trans.lm.fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.008299572 Df = 1 p = 0.9274116
# plot studentized residuals vs. fitted values
spreadLevelPlot(trans.lm.fit)
##
## Suggested power transformation: 0.3611456
plot(insurance$age, insurance$charges)
update(trans.lm.fit, ~age+I(log(bmi))+I(children^0.5)+smoker+region) %>%
#ncvTest()
#summary()
residualPlots()
## Test stat Pr(>|Test stat|)
## age 1.6556 0.09804 .
## I(log(bmi)) -0.8427 0.39956
## I(children^0.5) 0.9385 0.34814
## smoker
## region
## Tukey test -4.5980 4.266e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(charges~age+bmi+children+smoker, data = insurance) %>%
summary()
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11897.9 -2920.8 -986.6 1392.2 29509.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12102.77 941.98 -12.848 < 2e-16 ***
## age 257.85 11.90 21.675 < 2e-16 ***
## bmi 321.85 27.38 11.756 < 2e-16 ***
## children 473.50 137.79 3.436 0.000608 ***
## smokeryes 23811.40 411.22 57.904 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6068 on 1333 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7489
## F-statistic: 998.1 on 4 and 1333 DF, p-value: < 2.2e-16
tidy(lm.fit)
## term estimate std.error statistic p.value
## 1 (Intercept) -11938.5386 987.81918 -12.0857530 5.579044e-32
## 2 age 256.8564 11.89885 21.5866552 7.783217e-89
## 3 sexmale -131.3144 332.94544 -0.3944020 6.933475e-01
## 4 bmi 339.1935 28.59947 11.8601306 6.498194e-31
## 5 children 475.5005 137.80409 3.4505546 5.769682e-04
## 6 smokeryes 23848.5345 413.15335 57.7232020 0.000000e+00
## 7 regionnorthwest -352.9639 476.27579 -0.7410914 4.587689e-01
## 8 regionsoutheast -1035.0220 478.69221 -2.1621870 3.078174e-02
## 9 regionsouthwest -960.0510 477.93302 -2.0087563 4.476493e-02
plot(lm.fit)
# Assessing Outliers
outlierTest(lm.fit) # Bonferonni p-value for most extreme obs
## rstudent unadjusted p-value Bonferonni p
## 1301 5.009599 6.1887e-07 0.00082805
## 578 4.219800 2.6112e-05 0.03493800
qqPlot(lm.fit, main="QQ Plot") #qq plot for studentized resid
## [1] 1231 1301
leveragePlots(lm.fit) # leverage plots
lm.fit <- lm(log(charges)~age+bmi+children+smoker, data = insurance)
summary(lm.fit)
##
## Call:
## lm(formula = log(charges) ~ age + bmi + children + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11340 -0.19883 -0.04688 0.07197 2.07581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.9827766 0.0697227 100.151 < 2e-16 ***
## age 0.0347826 0.0008805 39.502 < 2e-16 ***
## bmi 0.0106096 0.0020264 5.236 1.91e-07 ***
## children 0.1011976 0.0101989 9.922 < 2e-16 ***
## smokeryes 1.5432438 0.0304372 50.703 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4491 on 1333 degrees of freedom
## Multiple R-squared: 0.7622, Adjusted R-squared: 0.7614
## F-statistic: 1068 on 4 and 1333 DF, p-value: < 2.2e-16
plot(lm.fit)
# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(lm.fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 69.54001 Df = 1 p = 7.488006e-17
# plot studentized residuals vs. fitted values
spreadLevelPlot(lm.fit)
##
## Suggested power transformation: 1.253853
# Test for Autocorrelated Errors
durbinWatsonTest(lm.fit)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02799903 2.054062 0.274
## Alternative hypothesis: rho != 0
# Assessing Outliers
outlierTest(lm.fit) # Bonferonni p-value for most extreme obs
## rstudent unadjusted p-value Bonferonni p
## 517 4.664824 3.3996e-06 0.0045487
## 220 4.643317 3.7673e-06 0.0050407
## 431 4.611197 4.3883e-06 0.0058715
## 103 4.583994 4.9899e-06 0.0066765
## 1028 4.493947 7.5979e-06 0.0101660
## 527 4.304437 1.7966e-05 0.0240390
## 1020 4.273503 2.0611e-05 0.0275780
## 1040 4.231808 2.4769e-05 0.0331410
qqPlot(lm.fit, main="QQ Plot") #qq plot for studentized resid
## [1] 220 517
leveragePlots(lm.fit) # leverage plots
vif(lm.fit)
## age bmi children smoker
## 1.014498 1.012194 1.001950 1.000745
# Evaluate Nonlinearity
# component + residual plot
crPlots(lm.fit)
# Ceres plots
ceresPlots(lm.fit)
## Warning in ceresPlots(lm.fit): Factors skipped in drawing CERES plots.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -0.025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 4.018e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -0.025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 4.018e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4
Sources : 1. https://www.kaggle.com/mirichoi0218/insurance/home 2. An Introduction to Statistical Learning and Reasoning 3. Wikipedia 4. https://www.statmethods.net/stats/rdiagnostics.html 5. https://www.statmethods.net/stats/regression.html 6. https://datascienceplus.com/how-to-detect-heteroscedasticity-and-rectify-it/